There are 159 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/105, DS/106, DS/107, DS/108, DS/121, DS/122, DS/131, DS/132, DS/142, DS/15, DS/159, DS/172, DS/173, DS/174, DS/178, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/236, DS/242, DS/244, DS/245, DS/250, DS/26, DS/263, DS/269, DS/270, DS/272, DS/280, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/338, DS/339, DS/345, DS/354, DS/362, DS/372, DS/373, DS/374, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/411, DS/416, DS/418, DS/42, DS/420, DS/421, DS/427, DS/43, DS/439, DS/44, DS/440, DS/45, DS/458, DS/46, DS/463, DS/468, DS/48, DS/488, DS/509, DS/515, DS/516, DS/521, DS/526, DS/54, DS/56, DS/567, DS/57, DS/571, DS/58, DS/585, DS/59, DS/596, DS/6, DS/60, DS/612, DS/621, DS/622, DS/626, DS/635, DS/638, DS/65, DS/67, DS/672, DS/694, DS/699, DS/7, DS/701, DS/726, DS/73, DS/74, DS/744, DS/747, DS/75, DS/755, DS/763, DS/772, DS/773, DS/775, DS/776, DS/779, DS/780, DS/781, DS/789, DS/79, DS/790, DS/798, DS/799, DS/8, DS/80, DS/808, DS/809, DS/81, DS/815, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/90, DS/900, DS/901, DS/904, DS/905, DS/922, DS/926, DS/927, DS/953, DS/96, DS/984, DS/987, DS/993, DS/997, DS/998
| example_link | model | min_elo |
|---|---|---|
| DS/104 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/946 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/129 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/348 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/282 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/253 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/743 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/304 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/677 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/222 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/505 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/357 | gpt-4-turbo-2024-04-09 | 1197.352 |
| DS/130 | gpt-4-0613 | 1149.500 |
| DS/774 | gpt-4-0613 | 1149.500 |
| DS/422 | gpt-4-0613 | 1149.500 |
| DS/765 | gpt-4-0613 | 1149.500 |
| DS/134 | gpt-4-0613 | 1149.500 |
| DS/784 | gpt-4-0613 | 1149.500 |
| DS/386 | gpt-4-0613 | 1149.500 |
| DS/903 | gpt-4-0613 | 1149.500 |
| DS/39 | gpt-4-0613 | 1149.500 |
| DS/154 | gpt-4-0613 | 1149.500 |
| DS/807 | gpt-4-0613 | 1149.500 |
| DS/902 | gpt-4-0613 | 1149.500 |
| DS/243 | deepseek-ai-deepseek-coder-6.7b-instruct | 1097.705 |
| DS/66 | deepseek-ai-deepseek-coder-6.7b-instruct | 1097.705 |
| DS/750 | deepseek-ai-deepseek-coder-6.7b-instruct | 1097.705 |
| DS/749 | deepseek-ai-deepseek-coder-6.7b-instruct | 1097.705 |
| DS/751 | deepseek-ai-deepseek-coder-6.7b-instruct | 1097.705 |
| DS/995 | microsoft-wavecoder-ultra-6.7b | 1093.117 |
| DS/679 | microsoft-wavecoder-ultra-6.7b | 1093.117 |
| DS/681 | microsoft-wavecoder-ultra-6.7b | 1093.117 |
| DS/201 | m-a-p-OpenCodeInterpreter-DS-6.7B | 1045.520 |
| DS/806 | m-a-p-OpenCodeInterpreter-DS-6.7B | 1045.520 |
| DS/671 | m-a-p-OpenCodeInterpreter-DS-6.7B | 1045.520 |
| DS/55 | codex002 | 1016.940 |
| DS/87 | gpt-3.5-turbo-0125 | 1003.249 |
| DS/582 | gpt-3.5-turbo-0125 | 1003.249 |
| DS/51 | m-a-p-OpenCodeInterpreter-CL-7B | 1002.053 |
| DS/447 | m-a-p-OpenCodeInterpreter-CL-7B | 1002.053 |
| DS/764 | gpt-3.5-turbo-0613 | 1000.000 |
| DS/153 | gpt-3.5-turbo-0613 | 1000.000 |
| DS/813 | gpt-3.5-turbo-0613 | 1000.000 |
| DS/200 | m-a-p-OpenCodeInterpreter-SC2-3B | 988.599 |
| DS/199 | m-a-p-OpenCodeInterpreter-SC2-3B | 988.599 |
| DS/93 | m-a-p-OpenCodeInterpreter-SC2-3B | 988.599 |
| DS/388 | m-a-p-OpenCodeInterpreter-SC2-7B | 973.177 |
| DS/604 | m-a-p-OpenCodeInterpreter-SC2-7B | 973.177 |
| DS/812 | m-a-p-OpenCodeInterpreter-SC2-7B | 973.177 |
| DS/766 | m-a-p-OpenCodeInterpreter-SC2-7B | 973.177 |
| DS/157 | ibm-granite-granite-8b-code-base | 945.362 |
| DS/188 | ibm-granite-granite-8b-code-base | 945.362 |
| DS/227 | meta-llama-Meta-Llama-3-8B | 913.471 |
| DS/899 | meta-llama-Meta-Llama-3-8B | 913.471 |
| DS/305 | deepseek-ai-deepseek-coder-6.7b-base | 905.459 |
| DS/654 | meta-llama-CodeLlama-13b-Python-hf | 901.914 |
| DS/474 | microsoft-wavecoder-pro-6.7b | 901.063 |
| DS/224 | microsoft-wavecoder-pro-6.7b | 901.063 |
| DS/996 | microsoft-wavecoder-pro-6.7b | 901.063 |
| DS/241 | ibm-granite-granite-8b-code-instruct | 901.060 |
| DS/240 | ibm-granite-granite-8b-code-instruct | 901.060 |
| DS/462 | google-codegemma-1.1-7b-it | 882.622 |
| DS/346 | meta-llama-Meta-Llama-3-8B-Instruct | 876.276 |
| DS/164 | meta-llama-CodeLlama-7b-Python-hf | 827.569 |
| DS/95 | claude-3-sonnet-20240229 | 789.836 |
| DS/27 | claude-3-sonnet-20240229 | 789.836 |
| DS/867 | meta-llama-CodeLlama-7b-hf | 774.037 |
| DS/161 | meta-llama-CodeLlama-7b-hf | 774.037 |
| DS/264 | gpt-4o-2024-05-13 | 751.120 |
| DS/281 | gpt-4o-2024-05-13 | 751.120 |
| DS/887 | microsoft-phi-2 | 750.630 |
| DS/533 | Qwen-CodeQwen1.5-7B-Chat | 737.485 |
| DS/165 | mistralai-Mistral-7B-Instruct-v0.2 | 730.948 |
| DS/64 | Salesforce-codegen25-7b-instruct_P | 705.113 |
| DS/886 | google-codegemma-1.1-2b | 661.904 |
| DS/424 | google-codegemma-2b | 594.298 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| DS/880 | 0.302 | -0.310 |
| DS/392 | 0.429 | -0.225 |
| DS/882 | 0.111 | -0.189 |
| DS/470 | 0.032 | -0.164 |
| DS/611 | 0.286 | -0.132 |
| DS/424 | 0.016 | -0.127 |
| DS/523 | 0.032 | -0.123 |
| DS/886 | 0.016 | -0.109 |
| DS/64 | 0.016 | -0.095 |
| DS/881 | 0.270 | -0.092 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.